Experimental evaluation of baselines for forecasting social media timeseries

Forecasting social media activity can be of practical use in many scenarios, from understanding trends, such as which topics are likely to engage more users in the coming week, to identifying unusual behavior, such as coordinated information operations or currency manipulation efforts. To evaluate a new approach to forecasting, it is important to have baselines against which to assess performance gains. We experimentally evaluate the performance of four baselines for forecasting activity in several social media datasets that record discussions related to three different geo-political contexts synchronously taking place on two different platforms, Twitter and YouTube. Experiments are done over hourly time periods. Our evaluation identifies the baselines which are most accurate for particular metrics and thus provides guidance for future work in social media modeling.


Introduction
The importance of baselines for scientific progress cannot be overestimated. In their absence, the evaluation of substantive theoretical models is compromised: exact and accurate fit is rare, if not impossible, and so whether the fit is "good enough" is left to the eye of the beholder. It is much better to have an understanding of how much can be accomplished by relatively simple and widely utilized models of the process of interest and then compare the value added by one's substantive theoretical model. These simple and widely used models are "baselines. " In certain circumstances, they can themselves be successful at capturing key regularities at relatively low cost in terms of parameters estimated [1].
In a new domain of phenomena, like that of social media timeseries, baselines are even more important. And in such a new domain, it is worth exploring multiple baseline candidates since it is not a priori known which baseline does the best job at capturing key regularities. Indeed, it is possible, as we show, that which baseline does best depends on which key regularities one believes it is essential to capture. In the next section, we introduce three different baselines for social media time series: ARIMA [2][3][4][5][6], Hawkes process [7][8][9][10][11][12][13][14], and the Shifted Baseline [6,[15][16][17][18][19]. They have each been widely used in the field of time series analysis.
Our objective is to use data from multiple social media platforms streams to assess which of these baselines has better predictive power against the empirical data, so called ground truth (GT) data streams for capturing key properties (e.g., volume of activities, temporal patterns, distribution characteristics). Social media platforms function as an attention economy [20,21] in which providers of content (both individuals and organizations) in effect compete for the market share of attention. In general, there is reason to believe that activities in attention economies are amenable to analysis and prediction [22][23][24].
Via experimental evaluation from six independent datasets (three geo-political contexts over two different social media platforms) and a total of 30 time series of topics in social media, we show that each baseline has advantages and limitations based on the end-task and particular time series characteristics that we aim to capture. Two prediction time periods, daily and weekly, are examined. We observed that, for example, ARIMA better estimates the overall volume of activities over a long forecasting horizon, but it fails to produce accurate temporal patterns/signals present in the GT. When the detection of anomalous or irregular periods of activity is of interest, comparing our predictions against ARIMA could be problematic and might lead to misleading conclusions. For these scenarios, Hawkes processes can be a more appropriate benchmark. Simple baseline approaches that consider shifting past observations as the forecast proved to be highly competitive and a strong benchmark for many scenarios, especially, for next day forecasts. However, when topics are heavily influenced by external variables (e.g., real-world events, economic fluctuations, emergency responses, etc.), the shifted baseline is likely to be less useful over longer prediction windows. Finally, we show that ensemble approaches could provide guidance for better benchmarks.

Traditional baselines for time series
What is a competitive baseline against which to compare approaches for predicting time series for social media activity? This question was motivated by our work on the DARPA SocialSim project with a focus on forecasting social media activity in selected contexts at micro-level granularity [6,15,16,[23][24][25][26][27]. In the DARPA challenges, our forecasts were evaluated against what we call below the Shifted Baseline, while reviewers of our work submitted for publication consistently suggested a comparison with ARIMA models. This paper addresses this disjuncture systematically. We selected three well-known time-series generators used in different applications, and applied them to three datasets from two social media platforms each. The three baselines typically used for time series forecasting are briefly presented below. In addition, we combine two of these baselines, ARIMA and Hawkes, into an ensemble that simply averages their outputs in an attempt to obtain a better baseline.

ARIMA
ARIMA is a traditional approach to modeling dynamical systems. This model is derived from a composition of autoregressive and moving average modeling processes, which are commonly represented by p and q, with the addition of a differencing term (I) represented by d. In short, ARIMA assumes that data are linear and follow a particular probability distribution. That is, for any observed variable X, it can be decomposed into fixed trend, seasonal, and irregular components over time. The output of the ARIMA model can be expressed as a linear combination of past values of X t and lagged forecast errors t : where φ 1 , . . . , φ p and θ 1 , . . . , θ q are the regression weights to be estimated, μ represents the constant trend component, and t-1 , . . . , t-q are the random errors. Despite suffering from limitations when presented with non-linear tasks due to its linearity assumption [2], ARIMA is widely accepted and used as a base model in most time series research [3][4][5][6].

Hawkes processes
Hawkes processes have been applied in a variety of settings in order to describe or predict univariate or multivariate data. Unlike discrete-time models (in which temporal data is aggregated into uniform time intervals), Hawkes processes rely on discrete events in continuous time (that is, with exact timestamps) to model the likelihood of an event occurring in the future. They are based on the idea that events that are observed in time frequently cluster together. For example, a post published by an influential social media user is likely to be followed by a significant number of interactions from other social media users. Hawkes processes model the inter-arrival timing of these events/interactions by using a conditional intensity function, which can be influenced by past/historical events. Hawkes processes can be defined as follows: where μ is the conditional intensity rate, and ϕ is a triggering kernel function that takes as input the delay between the current and previous timestamps of events. In this paper, we used the exponential decay kernel function defined by ϕ(x) = αβ exp(-βx), where α and β regulate the growth and decay rate of events based on observed data. Hawkes processes were originally defined to describe earthquake dynamics [7], but previous work has used them for forecasting tasks such as estimating price fluctuations in stocks [8] or reproducing conversation dynamics as event sequences [9]. In social media, Hawkes processes have been adopted to model information diffusion and to predict the popularity of content [10][11][12][13]. Despite some success, they often struggle to capture characteristic patterns usually observed on social media discussions such as topics with low volume of activities [14].

Shifted baseline
The Shifted (or Persistence) Baseline model is the third common baseline approach we investigate. It is defined as follows. Let T be the current timestep of interest, and let S be the length of the desired predicted time series. The Shifted Model predicts the time series for period T + 1 to T + S by moving forward, or "shifting forward" the time series that spans period T -S + 1 to T. The underlying assumption of this model is that the immediate future of the time series will approximately replicate its immediate past.
The Shifted baseline has been used in many domains: in social media time series prediction [6,15,16], in healthcare time series prediction (patients' weekly average expenditures on certain pain medications) [17], and in weather time series forecasting [18,19].

Ensemble Hawkes and ARIMA
The quantitative evaluations presented later suggested a combination of the Hawkes processes approach and the ARIMA approach in an ensemble could be useful. As shown in the next section, Hawkes processes offer better approximations of fluctuations in activity over time, while ARIMA offers better approximations of the amplitude of activity. Therefore, these results suggest an ensemble of the two approaches might prove valuable. Our ensemble is simple: it averages the two source predictions over time, assuming that each approach contributes equally to the final predicted value.

Datasets
We use data from three different geo-political contexts, each with their own different reactions on social media. The datasets contain a record of user activity on two social media platforms, Twitter and YouTube. The focus on multiple datasets from different contexts and platforms attempts to capture regularities attributable to a baseline rather than features related to data or a social media platform.

Three geo-political contexts
Our datasets include a period of political unrest in Venezuela in early 2019, discussions around the Chinese-Pakistan Economic Corridor, and discussions around the Chinese-East Africa Development Project.
Venezuelan Political Crisis of 2019 (Vz19) In 2019, a humanitarian, economic and political crisis engulfed Venezuela as the presidency was disputed between Juan Guaidó and Nicolás Maduro, each claiming to be the country's rightful president. The political turmoil had its roots in the controversial re-election of Nicolás Maduro as the head of the state on January 10th, which was boycotted by opposition politicians and condemned by the international community as fraudulent. Consequently, on January 23rd, the oppositionled National Assembly ratified Juan Guaidó as the country's interim president in an effort to restore democracy and constitutional rights. This event contributed to unprecedented conflicts and high political tension which resulted in nationwide protests, militarized responses, international humanitarian aid intervention attempts, and incidents of mass violence and arrests. Our dataset covers two major events unfolding in the country. First, the declaration of Juan Guaidó as Venezuela's interim president which triggered massive antigovernment protests across the country. Second, the arrival of humanitarian aid for the country which was met with violent standoffs between protesters and military forces. Our models are trained on data from the first event and evaluated on data from the second.
Chinese-Pakistan Economic Corridor (CPEC) CPEC is a strategic economic project under the Belt and Road Initiative (BRI) launched by China aimed at strengthening and modernizing road, port and energy transportation systems in Pakistan. While China's investment in Pakistan has largely taken the form of infrastructure development, the project has been heavily criticized with claims of lack of transparency, self-benefit and imposing unsustainable debt on the Southeast Asian country [28]. For example, the Indian government firmly stands against CPEC as the project involves traversing parts of the disputed Jammu and Kashmir regions [29]. Conflict around this project plays out in discussions on social media, where it has been reported that both Pakistani and Chinese state actors strategically promote Chinese interests to facilitate building projects and garner further investment. On the other hand, factions opposed to CPEC perpetuate unfavorable narratives such as the project only being beneficial for China to boost its influence and trading power [30,31].
Belt and Road Initivative in East Africa (BRIA) As part of the same BRI flagship project, China has committed a substantial amount of resources and investment across Africa, especially in the east African nations. Similar to the CPEC scenario, several concerns have arisen over China's strategic intentions in Africa. Chinese officials claim BRIA is a purely economic construct to enhance international cooperation. However, other parties suggest that the project is a geopolitical tool to boost China's global strategic influence [32]. Due to the heavy influence of Chinese media in Africa [33], BRIA activities have been promoted extensively on the continent through public pronouncements from ambassadors and government officials. At the same time, many citizens have openly expressed resistance to China's BRIA as they believe it creates an era of external dependency, internal conflicts, and a rising debt trap [32].

Twitter and YouTube data collection
The datasets used in this work were collected by Leidos and are part of the DARPAsponsored SocialSim program [34]. Twitter data was collected using the GNIP API and YouTube data was collected through its public API query tool. The collection was based on a list of keywords relevant to each context and was compiled by our SocialSim collaborators. Table 1

Topic assignment
As an effort to investigate how activity/information diffuses for different topics of discussion in the social media ecosystem, Leidos employed annotators and subject matter experts (SME) to identify a set of relevant topics relating to each of the three contexts. After a thorough exploration of the datasets corpus, SME crafted a total of 18 topics for Vz19, 12 topics for CPEC and 12 topics for BRIA. For each of the three contexts, the annotators created initial annotation sets comprised of a small subset of tweets and YouTube posts (i.e., videos and comments), which were then labeled with their corresponding topics. The annotations were conducted manually and over a subsample of 11,218 distinct text documents for each of the three contexts. The annotation process consisted of an 8 to 1 ratio of single annotator annotations to all annotator annotations. The 18 topics identified in Vz19 had a weighted average Cohen's Kappa score of 0.64 for inter-annotator agreement. The 12 topics in CPEC had inter-annotator agreement of 0.74 for Cohen's Kappa. Lastly, the 12 topics in BRIA had an average Cohen's Kappa score of 0.65. Due to the substantial variation in agreement scores among individual topics, we decided to focus this work on the 5 topics with the highest inter-annotator agreement scores from each context. Table 2 presents the topics chosen alongside their agreement scores and a brief description. Because it is not feasible to manually label millions of messages, BERT models [35] were trained and tested for topic annotation. The BERT models for each context were trained on 10,097 distinct text documents and evaluated on a 10% test set (1121 texts). Stratified sampling was used to ensure that the train and test sets had approximately the same percentage of samples of each topic class as in the original manually annotated corpus. The F1 score of the BERT models on Vz19 and CPEC annotation test sets were 0.66 and 0.73, respectively. For BRIA, a XLM-Roberta-Large pre-trained model [36] was used instead to label the texts. Its F1 score was 0.83 on the annotation test set.
After topic assignment and only considering the top 5 topics in inter-annotator agreement (see Table 2 Table 2 also shows the number of activities broken down per topic in both Twitter and YouTube. We observe that both platforms exhibit clear differences in terms of the magnitude of activities. Topic activity on Twitter is orders of magnitude larger than YouTube activity in every context. This is expected as YouTube is mainly used for audiovisual media consumption rather than micro-blogging, where large-scale user interactions are more common. We also find differences across contexts, for instance, the Venezuelan political crisis in 2019 (Vz19) spawned much larger discussions, compared to CPEC and BRIA, on both social media platforms as different external events unfolded in the country.

Evaluation
Our objective is to assess how well different baselines can capture key properties of time series in the context of per-topic social media activity. To this end, we evaluate the baselines described above against real data and across a set of relevant performance metrics.

Experimental setup
We evaluate performance in forecasting social media activity on two platforms, Twitter and YouTube, on a number of topics from three different geo-political contexts. We split the time series samples into training, validation and testing sets for each context as shown in Table 3. The baselines forecast one week of social media activity at hour granularity, thus, 168 datapoints corresponding to the 168 hours in a week. For Hawkes, we use the sequence of previous events timestamps in order to estimate an intensity function that approximates the likelihood of events happening in the future. The model parameters are: (1) the intensity rate, μ, (2) the infectivity factor which represents  (3) the exponential delay which is the probability distribution of time between events, β. These parameters are fit from the training data using an expectation-maximization algorithm that seeks to maximize the log-likelihood function. The output of the fit model results in timestamps of future events, which we aggregate into discrete-time intervals for evaluation (in our case, hour granularity). For ARIMA, we use the validation set to select a combination of optimized parameters (p, q, d) for each topic. That is, each hour of predicted activity per topic is dependent on the previous p hours of activities and the previous q hours of estimation errors. The differencing factor d indicates the order of transforming a non-stationary time series into a stationary one.
We ran a grid search over different combinations of (p, q, d). At hourly granularity, we considered 24, 48, 72, and 96 previous hours for both p and q values. Values of 0, 1 and 2 were considered for d. Table 4 shows the parameter combinations that obtained the lowest RMSE score over the validation period for each topic.
We report six performance metrics. Here, y t indicates the true value of the time series andŷ t the forecast series. Y t andŶ t represent the aggregated volume of activities for the ground truth and the forecast, respectively: 1. Absolute percentage error (APE), which evaluates the error over the aggregated volume of activities within the forecasting period, 2. Root mean squared error (RMSE), which measures the differences between predicted and actual values over time, 3. Symmetric mean absolute percentage error (sMAPE), which measures relative errors over time, 4. Dynamic time warping (DTW), for measuring the similarity between two time series that may vary in timing. DTW takes as input two time sequences, X and Y, of length n, where X = {x 1 , . . . , x n } and Y = {y 1 , . . . , y n } and computes the matching cost based on the following formulation, where d is a distance function (e.g., euclidean distance), and consequently, 5. Volatility error (VE), for measuring the absolute difference between the standard deviation (σ ) of the ground truth and forecasted time series, 6. Skewness error (SkE), for measuring asymmetry in the volume of activities over time. The SkE metric is measured by computing the adjusted Fisher-Pearson standardized moment coefficient (Eq. (9)) of both the ground truth and forecast time series, where n is the length of the time series, x is the sample mean and s is the sample standard deviation, and then calculating their absolute difference, where G 1 andĜ 1 are the standardized moment coefficients for the ground truth and forecast, respectively, We selected these metrics for evaluation as they measure specific characteristics of time series that we believe are of relevance to social media forecasting solutions. First, we use APE to assess how well a model can capture the total volume of activities per-topic. This property is important for scenarios where the identification of viral/popular topics across social media platforms is of interest. For temporal patterns and volume of activity over time, we use DTW, RMSE and sMAPE measures. The temporal characteristics of time series are valuable for identifying anomalous periods of activity and making informed decisions based on predictions. Finally, we measure skewness and volatility errors, which are important characteristics that can describe social media time series distributions in terms of their shape and variability. The authors of [37] noted the importance of these 2 particular metrics because of instances in which a model reported strong performance using more typical metrics such as RMSE, but were weak at capturing variability or "spikiness" upon visual representation.

Hardware and runtime costs
All experiments were run on a computer with an Intel Xeon E5-2630 v4 CPU. Each CPU had two sockets, 10 cores, 2 threads per core, and 512 GB of memory. The ARIMA-based models took roughly 12 hours per topic. We ran ARIMA experiments in parallel across 5 computers, and ended up with a total training time duration of 36 hours for all topics. Hawkes processes took on average 2 minutes for each topic and a total run time of 1.5 hours for all topics. The shifted baseline is trivially created by moving historical predictions forward; thus, there is no training phase involved.

Empirical results
In comparing the performance of the baselines against the ground truth, we answer the following questions: 1. Which baselines better capture the volume of activity? 2. Which baselines better capture the temporal patterns? 3. Which baselines better capture the skewness and volatility? 4. How does the duration of the forecast window affect the performance of the baselines? A high level overview of results is found in Table 5 which presents the performance of baselines for predicting time series of activities across social media topics over different metrics, contexts and platforms at hour granularity. Highlighted in bold are the scores for the baseline that performed best for a metric and topic for a platform. Before we discuss the results in detail, we want to emphasize that they vary significantly by topic and by platform. Thus, we try to identify consistent patterns, but stress the patterns hold in most but not all cases.
We then utilized an aggregate metric, called the Overall Normalized Metric Error (ONME) [37] that better illustrates relative performance among models. Tables 6 and 7 contain the results of this metric. Utilizing this new metric, we find that the Shifted Baseline seems to be the strongest performing baseline.

Volume of activity
APE and sMAPE are measures for the volume of activity. Figure 1 presents the APE and Fig. 2 the sMAPE per topic for Twitter and YouTube. We make the following observations. First, there is consistency across topics within the same context, but not across contexts with regards to predicting the total volume of activities. For example, in CPEC, ARIMA performs poorly across most topics on the two platforms (Figs. 1c and 1d), yet in Vz19 it is best at predicting the total volume of activity overall (2 out of 5 topics in Twitter and 4 out of 5 topics in YouTube). Another example is that Hawkes processes perform the worst on all Vz19 topics in YouTube (Fig. 1b). A third example is the shifted baseline which performs the best overall in CPEC (6 out of 10 topics) and BRIA (7 out of 10 topics). One reason for the good performance of the shifted baseline in CPEC/BRIA (or, conversely, its poor performance for Vz19 topics) is the context of our datasets: social media activity Table 5 Forecasting performance for total activity predicted across topics in each domain and platform. Baselines were set to forecast one week of activity at hour granularity (168 hours). To compute each metric, we aggregated the predicted activity across all topics and compare with the total number of activities in ground truth. Thus, the performance results are implicitly weighted by the number of activities in each topic. Bold entries are for the best aggregate metric in Vz19 (Venezuela political unrest) is strongly correlated to exogenous activity (such as street protests, power outages, political declarations), thus simply replaying the past does not capture potential reactions to current events. Moreover, the shifted baseline replays the actions from seven days before, while Hawkes processes and ARIMA at least take into account recent history. Second, we note that ARIMA performs consistently better than other baselines in sMAPE for 12 out of 15 topics on YouTube (as shown in Figs. 2b, 2d, 2f ). For low-activity topics, as is the case for most topics in YouTube, ARIMA can more accurately predict the volume of activities per hour while baselines such as Hawkes or the Shifted baseline often over predict. On the other hand, for Twitter, there is not a persistent winner across most topics using sMAPE. Yet ARIMA is rarely ranked as the worst performing baseline in this metric (only 3 out of 15 topics in total).

Temporal patterns
Visual inspection of the time series in Figs. 3 and 4 show that in many instances ARIMA produces a highly unrealistic pattern of variation that diminishes the seeming advantage in performance discussed above. While in some cases ARIMA captures well the periodicity in the ground truth data at hour granularity (as in, for example, Figs. 3a, 3b and 3g for Twitter or Figs. 4a, 4b and 4c for YouTube), it either misses the amplitude of the signal by an order of magnitude (as shown in Fig. 3e), or creates a highly regular (and unrealistic) pattern (Fig. 3a). In other cases, however, it predicts a linear time series that is highly unrealistic, as in Fig. 3d.
Yet despite these visible limitations, traditional time series error metrics such as RMSE show a more favorable picture of ARIMA when compared to both Hawkes and the shifted baselines. For example, in Table 5, which shows the weighted performance over all topics in a context/plaform, ARIMA appears to be the best performing baseline on RMSE (5 out of 6 possible instances). This observation also holds when evaluated per topic, where ARIMA outperforms other baselines in 23 out of 30 instances (as shown in Fig. 5). This contradiction emphasizes the importance of model performance evaluation across additional dimensions. For instance, if we desire to accurately capture patterns over time, then metrics that measure similarity between temporal sequences (e.g., DTW) might be Figure 2 sMAPE hourly performance across topics for each context and over two platforms (lower is better). sMAPE scores for each topic are normalized between 0 to 1 relative to the sum of the baselines' errors. The results are for one week predictions at hourly granularity a more appropriate alternative. From Fig. 6, we observe that ARIMA often fails to outperform other baselines, especially Hawkes, with regards to DTW as in many cases it produces unrealistic temporal sequences. This observation excludes ARIMA as a meaningful methodology for creating representative social media time series patterns for scientific purposes.

Skewness and volatility
Despite ARIMA's unrealistic predictions, we noticed that traditional time series forecasting metrics fail to penalize it enough. Relying only on these particular metrics to make decisions on which baseline to select can be misleading. In fact, a representative baseline for social media time series is one that not only produces realistic activity levels over time but also captures well key properties related to the distribution of data. Figures 7 and 8 show baselines performance on capturing two basic characteristics of a distribution: spread (in terms of volatility error) and shape (in terms of skewness).
While in some topics ARIMA captures volatility well (e.g., "china/border" in CPEC or "travel" in BRIA as shown in Figs. 7c and 7e), on the majority of topics it is the worst across different contexts and platforms. From time series visualizations, it is evident that ARIMA completely misses the variations observed in the ground truth time series data. For example, in highly volatile topics such as "controversies/china/border" (Fig. 3d) or "leadership/sharif " (Fig. 3e), ARIMA predicts smooth time series with no signs of volatility. On the other hand, Hawkes and Shifted baselines do a better job at capturing rapid Figure 5 RMSE hourly performance across topics for each context and over two platforms (lower is better). RMSE scores for each topic are normalized between 0 to 1 relative to the sum of the baselines' errors. The results are for one week predictions at hourly granularity changes despite sometimes exhibiting irregular up-and-down trends. Similarly, ARIMA does poorly on capturing the shape of the data distribution based on its symmetry (Fig. 8). These patterns become clearer when looking at the aggregated results on volatility and skewness across topics in Table 5, where ARIMA's limitations on distribution-based metrics are more evident. Finally, we might want to select relevant baselines as informed by the utility of time series predictions: should we compare a particular time series against different baselines for different performance metrics? For example, consider ARIMA as a relevant baseline for volume, but another baseline for capturing representative patterns over time. To explore this question, we experimented with an ensemble approach (Hawkes + ARIMA), which takes advantage of both ARIMA's volume predictions and Hawkes' temporal patterns to possibly produce more accurate predictions. Overall the ensemble approach exhibits well-rounded performance across different metrics (Table 5), and proves promising for improving in some topics.

Overall normalized metric error analysis
We sought to have a more clear understanding of which models perform the best, so to that end, we utilized an aggregate metric called the Overall Normalized Metric Error, pre- Figure 6 DTW hourly performance across topics for each context and over two platforms (lower is better). DTW scores for each topic are normalized between 0 to 1 relative to the sum of the baselines' errors. The results are for one week predictions at hourly granularity viously used in [37]. It combines the 6 metrics shown in Table 5 into 1 metric per model. Intuitively, it shows how each model performed relative to the other models.
Tables 6 and 7 contain the ONME results for each (domain, platform) pairing for each model. Table 6 contains both the ONME results as well as the values (called "normalized errors") used to calculate them. Table 7 contains only the ONME values from 6 by themselves, for easier viewing and interpretation of results.
For a given (domain, platform) pairing, the ONME was calculated in the following way.
1. For each model, we calculated a normalized error score. This was done by taking the metric result for a particular model and metric, and then dividing that metric result by the sum of all the other models' metric results for the current (domain, platform) pair of interest. We indicate the normalized error for each metric using the prefix "n" (e.g. nAPE). These normalized errors are shown in Table 6. For example, for (Vz19, Twitter), we calculated ARIMA's normalized APE score by taking its raw APE from Table 5 (35.54), and then dividing it by the sum of all the models' APE scores for (Vz19, Twitter) from Table 5. So, the sum of all APEs would be 35.54 + 74.61 + 67.88 + 55.07, which is 233. 1 We then divided 35.54 by this number, which yielded ARIMA's normalized APE score (nAPE) for (Vz19, Twitter), Figure 7 Volatility hourly performance across topics for each context and over two platforms (lower is better). Volatility scores for each topic are normalized between 0 to 1 relative to the sum of the baselines' errors. The results are for one week predictions at hourly granularity which was about 0.1525. We repeated this process for each model and metric. The normalized error results for each model can be seen in Table 6. 2. By repeating this process for each model and metric, this resulted in 6 normalized metric values per model, each between 0 and 1, for a given (domain, platform) pair.
Intuitively, each normalized metric score indicates the relative performance of a model on a particular metric, in comparison to the other models. Note that since each error value is in the range of 0 to 1, the sum of each row in Table 6 adds up to 1.
3. Finally, for each model, we calculated the ONME score by taking the average of the values is 0.244, which is the ONME value shown in Table 6.
As previously mentioned, Table 7 contains only the ONME results from Table 6 by themselves, for easier viewing and interpretation. There is also a new row, that shows the average ONME for each model across all domains and platforms.

Figure 8
Skewness hourly performance across topics for each context and over two platforms (lower is better). Skewness scores for each topic are normalized between 0 to 1 relative to the sum of the baselines' errors. The results are for one week predictions at hourly granularity As one can see in the table, the Shifted Baseline has the best overall performance. Out of the 6 (domain, platform) pairings, it had the lowest ONME three times. In contrast, the Hawkes, ARIMA, and Hawkes + ARIMA models won in this regard only once each.
Furthermore, the Shifted model had the lowest average ONME across all domains and platforms. Its score was 0.2189. In the second, third, and fourth places were the Hawkes + ARIMA, Hawkes, and ARIMA models; respectively. Their ONMEs were 0.2334, 0.2368, and 0.3109; respectively. It is notable that ARIMA performed much worse than the other models in this regard, as it is such a commonly used time series baseline.

Forecast duration vs. baseline performance
One of our objectives is to understand how the baselines perform under different prediction window durations. Towards this, we developed a new set of models that predict just one day of activity at the hourly granularity (thus, 24 values, each corresponding to one hour of the day) to contrast with the 7 day predictions just discussed. We refer to these new models as Short-Term, while the previous models are called Long-Term.
For the Short-Term configuration, the Shifted model predictions were created by shifting the previous 24 hours of ground truth into the next day. This was done for all seven days in the test period.   All of the Short-Term ARIMA models predicted 24 hours out into the future at a time, for all 7 test day periods. Their p, d, and q hyperparameters were determined using a validation set, in a similar fashion to the Long-Term ARIMA model hyperparameters in Table 4. Table 8 contains the Short-Term ARIMA hyperparameter combinations.
As a reminder, the Long-Term models predict the seven days of activity following the training period at hour granularity. In the Long-Term configuration, the Shifted model predictions were created by shifting a full 168-hour block of past ground truth into the 168-hour test period and the ARIMA models predicted 168 hours of values without any ground truth after the start time as input to the model. The predictions of the previous 24-hour blocks were fed into the model as inputs for the next 24-hour block's prediction. We focus on the Shifted model because it had the strongest Overall Normalized Metric Error, and on the ARIMA model because of its high popularity in the literature [2][3][4][5][6].
Note that we use the terms Short-Term and Long-Term in a relative sense, and not an absolute one. The purpose of the following analysis is to better understand how extending the prediction windows of the Shifted and ARIMA models would affect their relative performance in comparison to one another. The reader may have their own preference for prediction window size, however we just want to demonstrate how extending or shortening it may alter relative model performance.
The Fig. 9 heatmap shows the Short-Term prediction results and the Fig. 10 heatmap shows the Long-Term prediction results of the Shifted and ARIMA models. In these heatmaps, if the Shifted Model outperforms the ARIMA model, a cell is green; otherwise, the cell is white. The values in each cell are the percent improvement scores of the Shifted models over the ARIMA models (thus, the larger the value, the better Shifted performs compared to ARIMA in the metric shown on the x axis and on the topic shown on the y axis).
For short-term (one day) predictions (Fig. 9), the Shifted model outperforms ARIMA on most topics and most metrics on both platforms. For the Short-Term Twitter predictions (Fig. 9a), the Shifted model wins 66 out of 90 trials, or 73.33% of the time. Similarly, for YouTube, the Shifted model wins 68 out of 90, or 75.56% of the time (Fig. 9b). APE is the However, the Shifted model loses its relative advantages over ARIMA in the Long-Term configuration, as shown in Fig. 10. In the Long-Term setup on Twitter data (10a), the Shifted model wins only slightly more than 50%. For YouTube (Fig. 10b), the Shifted model outperforms ARIMA 59% of the time.
Moreover, the advantage that ARIMA gained in the long term predictions are mainly in the volume-related metric (APE), and the exacting-timing-volume metrics (RMSE and sMAPE); Shifted maintains advantages in the approximate-timing-volume metric (dynamic time warping), as well as in the volatility-metrics (Skewness Error and Volatility Error). To conclude, for short-term predictions, the Shifted baseline is reliable (and in our experience, very challenging to outperform [38]) while ARIMA misses on most metrics in all contexts and both platforms. For long-term predictions, however, the two models perform similarly when aggregated over all metrics we analyzed, but with clear advantages in the volume-related and exact-timing-volume-related metrics for ARIMA and the approximate-timing-volume and volatility-related metrics for Shifted.

Summary and recommendations
This study was motivated by the desire to understand and document the relative performance benefits of various time series forecasting methodologies for social media activity. For this, we defined the objective of forecasting the number of social media activities per hour within a discussion context for the following 168 hours without using the ground truth information during the forecasting interval. This objective is from a typical forecasting problem, and was the task during the scientific challenges which were a part of the DARPA SocialSim project output evaluation. Forecasting social media activity has numerous practical applications, from generating realistic simulations and/or synthetic datasets for scientific experiments to testing intervention techniques meant to control processes on social media.
We chose three representative techniques for forecasting time series: Shifted, that simply replays the past, Hawkes processes, and ARIMA. We used three different conversations from different geo-political contexts, each manifested simultaneously on two social media platforms (Twitter and YouTube), and each comprised of five topics. Based on a combination of manual and automated annotation, each message was annotated with/assigned to one of the topics.
Using these rich datasets, we focused on answering the following question: What is a competitive baseline for forecasting social media activity? We discovered that the answer is nuanced: the Shifted Baseline performed the best in terms of Overall Normalized Metric Error, however it was still not "universally the best". Rather, a representative baseline should be chosen based on what the predicted time series will be used for. How should a competitive baseline be chosen for a particular scenario? It varies based on 1) the duration of the forecast window; and 2) the metrics that are most important for the forecast task (operational scenario, perhaps).
Specifically, we discovered that ARIMA best estimates the volume of activity when measured in RMSE, but it fails to accurately model temporal patterns. The Shifted baseline works well for short prediction windows in most measurements, but loses power for longer prediction windows. We speculate that Shifted does not match GT when some realworld events drive discussions online in a different manner, or maybe different processes (such as influence campaigns) are taking place on platforms. Hawkes processes create very agitated time series and never dominate in all measurements, possibly because social media activity is not only self-exciting, but external events are causing new excitement on the platform. There can also be non-transparent platform interventions (such as algorithmic promotion of particular messages on user feeds) that may disrupt the learned processes.
Overall, it appears that the Shifted baseline is the winner: replaying the recent past is a cheap and highly reliable estimation of the near future in many scenarios and performance metrics. Where the Shifted baseline will likely fail is when a topic on social media is heavily influenced by real life events, such as major news, political unrest, protests and armed conflicts. In such cases, ARIMA provides a better estimation of the overall volume of social media posts, although it fails at capturing the variation over time.
The choice of baselines depends on the particular characteristics that the problem attempts to capture/model. For example, if modeling the final volume of information cascades, or predicting next day activity (assuming previous day ground truth is always available), then a baseline like ARIMA could be a good choice. However, if the focus is to capture spiky behavior to identify anomalous periods in the future, for instance, then comparing against baselines such as Shifted or Hawkes might be more meaningful.
An ensemble of baselines could result in more competitive models to compare against. For example, when combining Hawkes and ARIMA by averaging, we observed an improvement for some topics in both volume of activity and temporal patterns.
We note that experiments were done with different granularity: in addition to predicting hourly activity, we also predicted daily activity for the same seven-day interval (not included here). Given how the Shifted and Hawkes processes work (that is, they predict the exact time when an event will occur, totally independent of the time granularity chosen), the only difference between day vs hour granularities are in the aggregation of the performance results over time. Thus, comparing these two methods at different time granularities tells more about the performance metrics chosen to evaluate their quality rather than about the methods themselves. ARIMA did not show significant differences in terms of volume prediction. Because many of the other metrics are not normalized, a direct comparison is not meaningful.
A final observation from this study is that different performance metrics tell incomplete and different stories-what is the minimum optimal set of performance metrics for estimating time series forecasting especially in the context of social media activity remains an open question.
Future work would include analysis of burstiness as described in [39]. Also of interest would be an analysis of the features of the time series across domains and platforms. This may yield some better understanding as to why certain models perform better or worse on certain time series. Lastly, we would try incorporating the Shifted model into the ARIMA + Hawkes ensemble model to see if better performance could be obtained.